numeric feature
ECG Question Answering Combined With Electrocardiogram
Question answering (QA) in the field of healthcare has received much attention due to significant advancements in natural language processing. However, existing healthcare QA datasets primarily focus on medical images, clinical notes, or structured electronic health record tables. This leaves the vast potential of combining electrocardiogram (ECG) data with these systems largely untapped. To address this gap, we present ECG-QA, the first QA dataset specifically designed for ECG analysis. The dataset comprises a total of 70 question templates that cover a wide range of clinically relevant ECG topics, each validated by an ECG expert to ensure their clinical utility. As a result, our dataset includes diverse ECG interpretation questions, including those that require a comparative analysis of two different ECGs. In addition, we have conducted numerous experiments to provide valuable insights for future research directions. We believe that ECG-QA will serve as a valuable resource for the development of intelligent QA systems capable of assisting clinicians in ECG interpretations.
What Signals Really Matter for Misinformation Tasks? Evaluating Fake-News Detection and Virality Prediction under Real-World Constraints
Savatteri, Francesco Paolo, Vidal-Gorène, Chahan, Cafiero, Florian
We present an evaluation-driven study of two practical tasks regarding online misinformation: (i) fake-news detection and (ii) virality prediction in the context of operational settings, with the necessity for rapid reaction. Using the EVONS and FakeNewsNet datasets, we compare textual embeddings (RoBERTa; with a control using Mistral) against lightweight numeric features (timing, follower counts, verification, likes) and sequence models (GRU, gating architectures, Transformer encoders). We show that textual content alone is a strong discriminator for fake-news detection, while numeric-only pipelines remain viable when language models are unavailable or compute is constrained. Virality prediction is markedly harder than fake-news detection and is highly sensitive to label construction; in our setup, a median-based ''viral'' split (<50 likes) is pragmatic but underestimates real-world virality, and time-censoring for engagement features is desirable yet difficult under current API limits. Dimensionality-reduction analyses suggest non-linear structure is more informative for virality than for fake-news detection (t-SNE > PCA on numeric features). Swapping RoBERTa for Mistral embeddings yields only modest deltas, leaving conclusions unchanged. We discuss implications for evaluation design and report reproducibility constraints that realistically affect the field. We release splits and code where possible and provide guidance for metric selection.
Artificial Intelligence for Cost-Aware Resource Prediction in Big Data Pipelines
Efficient resource allocation is a key challenge in modern cloud computing. Over-provisioning leads to unnecessary costs, while under-provisioning risks performance degradation and SLA violations. This work presents an artificial intelligence approach to predict resource utilization in big data pipelines using Random Forest regression. We preprocess the Google Borg cluster traces to clean, transform, and extract relevant features (CPU, memory, usage distributions). The model achieves high predictive accuracy (R Square = 0.99, MAE = 0.0048, RMSE = 0.137), capturing non-linear relationships between workload characteristics and resource utilization. Error analysis reveals impressive performance on small-to-medium jobs, with higher variance in rare large-scale jobs. These results demonstrate the potential of AI-driven prediction for cost-aware autoscaling in cloud environments, reducing unnecessary provisioning while safeguarding service quality.
R2VF: A Two-Step Regularization Algorithm to Cluster Categories in GLMs
Over recent decades, extensive research has aimed to overcome the restrictive underlying assumptions required for a Generalized Linear Model to generate accurate and meaningful predictions. These efforts include regularizing coefficients, selecting features, and clustering ordinal categories, among other approaches. Despite these advances, efficiently clustering nominal categories in GLMs without incurring high computational costs remains a challenge. This paper introduces Ranking to Variable Fusion (R2VF), a two-step method designed to efficiently fuse nominal and ordinal categories in GLMs. By first transforming nominal features into an ordinal framework via regularized regression and then applying variable fusion, R2VF strikes a balance between model complexity and interpretability. We demonstrate the effectiveness of R2VF through comparisons with other methods, highlighting its performance in addressing overfitting and finding a proper set of covariates.
Numeric Reward Machines
Levina, Kristina, Pappas, Nikolaos, Karapantelakis, Athanasios, Feljan, Aneta Vulgarakis, Seipp, Jendrik
Reward machines inform reinforcement learning agents about the reward structure of the environment and often drastically speed up the learning process. However, reward machines only accept Boolean features such as robot-reached-gold. Consequently, many inherently numeric tasks cannot profit from the guidance offered by reward machines. To address this gap, we aim to extend reward machines with numeric features such as distance-to-gold. For this, we present two types of reward machines: numeric-Boolean and numeric. In a numeric-Boolean reward machine, distance-to-gold is emulated by two Boolean features distance-to-gold-decreased and robot-reached-gold. In a numeric reward machine, distance-to-gold is used directly alongside the Boolean feature robot-reached-gold. We compare our new approaches to a baseline reward machine in the Craft domain, where the numeric feature is the agent-to-target distance. We use cross-product Q-learning, Q-learning with counter-factual experiences, and the options framework for learning. Our experimental results show that our new approaches significantly outperform the baseline approach. Extending reward machines with numeric features opens up new possibilities of using reward machines in inherently numeric tasks.
Counterfactual Generation with Answer Set Programming
Dasgupta, Sopam, Shakerin, Farhad, Arias, Joaquín, Salazar, Elmer, Gupta, Gopal
Machine learning models that automate decision-making are increasingly being used in consequential areas such as loan approvals, pretrial bail approval, hiring, and many more. Unfortunately, most of these models are black-boxes, i.e., they are unable to reveal how they reach these prediction decisions. A need for transparency demands justification for such predictions. An affected individual might also desire explanations to understand why a decision was made. Ethical and legal considerations may further require informing the individual of changes in the input attribute that could be made to produce a desirable outcome. This paper focuses on the latter problem of automatically generating counterfactual explanations. We propose a framework Counterfactual Generation with s(CASP) (CFGS) that utilizes answer set programming (ASP) and the s(CASP) goal-directed ASP system to automatically generate counterfactual explanations from rules generated by rule-based machine learning (RBML) algorithms. In our framework, we show how counterfactual explanations are computed and justified by imagining worlds where some or all factual assumptions are altered/changed. More importantly, we show how we can navigate between these worlds, namely, go from our original world/scenario where we obtain an undesired outcome to the imagined world/scenario where we obtain a desired/favourable outcome.
PARs: Predicate-based Association Rules for Efficient and Accurate Model-Agnostic Anomaly Explanation
Our user study shows that the anomaly explanation form of PARs is better understood and favoured by Anomaly detection, which aims to identify data instances regular anomaly detection system users compared with existing that do not conform to the expected behavior, is a classic model-agnostic anomaly explanation options. In our machine learning task with numerous applications in experiments, we demonstrate that it is significantly more various domains including fraud detection, intrusion detection, efficient to find PARs than anchors (Ribeiro, Singh, and predictive maintenance, etc. Over the past decades, numerous Guestrin 2018), another rule-based explanation, for identified methods have been proposed to tackle this challenging anomaly instances. Moreover, PARs are also far more problem. Examples include one-class classificationbased accurate than anchors for anomaly explanation, meaning (Manevitz and Yousef 2001; Ruff et al. 2018), nearest that they have considerably higher precision and recall when neighbor-based (Breunig et al. 2000), clustering-based applied as anomaly detection rules on unseen data other (Jiang and An 2008), isolation-based (Liu, Ting, and Zhou than the anomaly instance on which they were originally derived 2012; Hariri, Kind, and Brunner 2019), density-based (Liu, for explanation. Additionally, we show that PARs can Tan, and Zhou 2022; Feng and Tian 2021) and deep anomaly also achieve higher accuracy on abnormal feature identification detection models based on autoencoders (Zhou and Paffenroth compared with many state-of-the-art model-agnostic 2017; Zong et al. 2018), generative adversarial networks explanation methods including LIME (Ribeiro, Singh, and (Zenati et al. 2018; Han, Chen, and Liu 2021), to Guestrin 2016), SHAP (Lundberg and Lee 2017), COIN name a few.